Outlier Suppression: Pushing the Limit of Low-Bit Transformer Language Models

133

FIGURE 5.8

Presentation of outliers over ˜X, γ and X of LayerNorm on BERT-SST-2. For example, at

dimension 308, γ and ˜X both have sharper values. By excluding γ, it can be seen that X

holds milder distribution than ˜X.

FIGURE 5.9

The distribution using (mean + 3 * std) is drawn as the left border, then enumerating the

value to cut the tensor on RoBERTa-QNLI. The reflect the proportion of clipped tokens.

5.6.2

Gamma Migration

Specifically, the gamma migration produces a more quantization-friendly model by migrat-

ing the outlier amplifier γ into subsequent modules in an equivalent transformation and

bringing more robust activation for quantization without extra computation burden. As

shown in Fig. 5.10, γ will be excluded from the LayerNorm and moved to the shortcut

branch and weight of the next layer. As a result, the LayerNorm becomes the Non-scaling

LayerNorm. The shortcut branch and weight of the next layer absorb a new parameter γ.

From Fig. 5.10, the “Quant” process quantizes X. Then the quantized output engages two

branches respectively. The first is the matrix multiplication on the bottom branch. The

second is multiplying parameter γ and experiencing the “DeQuant” process. In fact, the γ

calculation is delayed from LayerNorm to the shortcut branch. Thus, this new design will

not increase the computation overhead.

FIGURE 5.10

Left: quantization flow before. Right: gamma migration.